Pivaj: an Article-centered Platform for Digitized Newspapers Newspapers Layout
نویسندگان
چکیده
PIVAJ is a platform for archived digitized newspaper emphasizing articles: extracting them from digitized documents by automated page layout analysis, OCRing them, indexing their text transcription to allow users to search for content. Crowdsourcing is used to improve the quality of the indexing, by correcting the transcription and by tagging articles with keywords. The platform has been used to give Web access to 550 000 articles generated from a digitized local newspaper. Current developments include further improvements to its OCR as well as graphical interfaces for the management of the platform. Articles as elementary parts of information in newspapers Newspapers are periodical publications containing news and feature articles, generally printed on low quality paper. The link among the information it contains is the periodicity itself: news are relevant to the day or the week of the issue, feature articles may be relevant to a wider timespan but are still generally closely linked to the period of printing. The common thread is temporal, and is consumed thus by the readers buying the newspaper. This immediate view of the present comes later on to be of use for other people, interested in these events of the past: historians, both pro and hobbyists, genealogists, or just individuals curious of local history. But for them, the dates of printing, the issue itself as a container of articles, often lack relevance, except as a way of context. They want to retrieve information; there is a theme to their research: they may need to parse 20 articles spread in 20 issues over 10 years of publication to get the data they want. PIVAJ is a platform for archived digitized newspaper built to help people easily access the precise information they search, even though it may be disseminated in many pieces. Articles are the unit of this information. Sections and articles are therefore the core materials PIVAJ tools manage and produce: extraction, analysis, transcription, indexing and visualization. Newspapers layout Historical newspapers have complex structures that may vary across the different layouts encountered. These structures provide much context to help the processing of the information contained by the newspapers, this is the reason why we aim at correctly extract and represent them in PIVAJ. In general, the headings (of different levels) and the separators, which are detected during the automatic layout extraction step, are the most helpful clues for extracting the hierarchical structure of a newspaper. Newspapers may be composed of: • A banner, which includes the name of the newspaper, and two ears (only on the title page) • Headers, footers and margins • Multiple sections (often delimited by horizontal separators) Newspaper sections are the containers of the articles. They may span on multiple pages of a single issue, and may have a heading, especially when they recurrently appear over the issues. They are often divided into several columns by vertical separators. See Figure 1 for a sample composition of a title page. Figure 1: Composition of a title page. Articles are in sections. Besides, articles also have complex and varying structures. As we defined them in PIVAJ, articles may contain: • A heading, which may span on multiple lines • A subheading, which is a heading of a lower level • Parts, which are marked either by horizontal separators or by consecutive subheadings • Paragraphs containing body text • Illustrations, tables, captions... • Annotations (signature, dates, footnotes) An article part may contain a heading, (sub)parts, paragraphs and other contents, just like an article does. Thanks to its recursive 40 © 2015 Society for Imaging Science and Technology nature, our definition of an article is generic enough to be applied on a variety of historical newspapers. However, in general, subparts (or deeper structures) are not needed. See Figure 2 for a sample composition of an article. Figure 2: Article example, with different parts, subheading, illustrations etc. Automatic layout extraction Once digitized, a newspaper is an organized collection of pictures. Each of these pictures from the point of view of the computer is just a collection of colored dots laying together rows after rows. This is far from the computational usability of native digital documents, such as a Word document or a Web page, which contain digital text and other textual metadata that allow indexing, searching, using multiple layouts etc. Several steps must be taken to grant those digitized documents some properties of these native digital documents. First, PIVAJ must understand the layout of each page of a newspaper. To this end, it uses several statistical machine learning algorithms. This learning characteristic implies the necessity to provide a certain amount of representative pages with their detailed ground truth, which are images where all the relevant information (i.e. all the texts, headings with their appropriate level, the separators, the pictures, the captions etc.) have been delimited as blocks by hand. PIVAJ is then able to learn from these examples how to automatically label the different parts of the images as textual content (body text, headings of different levels, captions etc.) or graphical content (separators, pictures etc.). Pictures are detected separately using an algorithm based on Random Forests [1], working on the grey levels histograms. Other content (text and separators) are detected by an algorithm based on Conditional Random Fields, as described in depth in [2]. These two analyses are then fused. The steps of the automatic page layout extraction are as follows: 1. First PIVAJ labels textual and graphical blocks, based on what it learnt from the ground truth examples. 2. From this step, it then builds a grid, assembling the visible separators and creating blank ones where needed (e.g.: between columns of text). 3. Using this grid it infers a first reading order, which allows it to use a second machine learned discrimination inside text zones to recognize the headings with their appropriate levels. Those parts are then assembled in sections and articles. However, this assemblage may vary between different newspapers, this is why we made it adaptive using regular expressions that describe the way heading, body text, separator, and graphical blocks detected in the previous stages should be aggregated to form a section or an article. The full process (labeling and assembling) is automatic. Figure 3 gives an example of this process, from the initial page to the illustrative colored representation of the detected articles, including the underlying hierarchy of sections. Figure 4 details this process at the article level. A graphical software tool is included in PIVAJ so that an operator may build the ground truth, manage the learning process, and then launch the analysis, without requiring any knowledge about the underlying image analysis and machine learning processes. Figure 3: extracting the parts of the layout from the digitized image by
منابع مشابه
Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design
Most tools for accessing digitized historical newspapers emphasize relatively simple search; but, as increasing numbers of digitized historical newspapers and other historical resources become available, we can consider much richer modes of interaction with these collections. For instance, users might use exploratory search for looking at larger issues and events such as elections and campaigns...
متن کاملA Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers
Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we desc...
متن کاملAutomated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features
Millions of pages of historical newspapers have been digitized but in most cases access to these are supported by only basic search services. We are exploring interactive services for these collections which would be useful for supporting access, including automatic categorization of articles. Such categorization is difficult because of the uneven quality of the OCR text, but there are many clu...
متن کاملDigital Preservation of Newspapers: Findings of the Chronicles in Preservation Project
In this paper, we describe research led by Educopia Institute regarding the preservation needs for digitized and born-digital newspapers. The Chronicles in Preservation project, builds upon previous efforts (e.g. the U.S. National Digital Newspaper Program) to look more broadly at the needs of digital newspapers in all of their diverse and challenging forms. This paper conveys the findings of t...
متن کاملOlder Adults as Discursively Constructed in Taiwanese Newspapers: A Critical Discourse Analysis
This paper uses critical discourse analysis to examine discursive representations of older people in Taiwanese newspapers. A total of 926 references to older people were sampled from 62 articles published in four Taiwanese newspapers from January to August 2013. The findings suggest that, older people were frequently allocated roles suggestive of dependency. Those portrayed in line with the pos...
متن کامل